AITopics | coherent risk measure

Several authors have recently developed risk-sensitive policy gradient methods that augment the standard expected cost minimization problem with a measure of variability in cost. These studies have focused on specific risk-measures, such as the variance or conditional value at risk (CVaR). In this work, we extend the policy gradient method to the whole class of coherent risk measures, which is widely accepted in finance and operations research, among other fields. We consider both static and time-consistent dynamic risk measures. For static risk measures, our approach is in the spirit of policy gradient algorithms and combines a standard sampling approach with convex programming. For dynamic risk measures, our approach is actor-critic style and involves explicit approximation of value function. Most importantly, our contribution presents a unified approach to risk-sensitive reinforcement learning that generalizes and extends previous results.

name change, policy gradient, risk measure, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Bayesian Risk-Sensitive Policy Optimization For MDPs With General Loss Functions

Wang, Xiaoshuang, Lin, Yifan, Zhou, Enlu

arXiv.org Artificial IntelligenceOct-2-2025

Motivated by many application problems, we consider Markov decision processes (MDPs) with a general loss function and unknown parameters. To mitigate the epistemic uncertainty associated with unknown parameters, we take a Bayesian approach to estimate the parameters from data and impose a coherent risk functional (with respect to the Bayesian posterior distribution) on the loss. Since this formulation usually does not satisfy the interchangeability principle, it does not admit Bellman equations and cannot be solved by approaches based on dynamic programming. Therefore, We propose a policy gradient optimization method, leveraging the dual representation of coherent risk measures and extending the envelope theorem to continuous cases. We then show the stationary analysis of the algorithm with a convergence rate of $\mathcal{O}(T^{-1/2}+r^{-1/2})$, where $T$ is the number of policy gradient iterations and $r$ is the sample size of the gradient estimator. We further extend our algorithm to an episodic setting, and establish the global convergence of the extended algorithm and provide bounds on the number of iterations needed to achieve an error bound $\mathcal{O}(ε)$ in each episode.

Add feedback

Policy Gradient for Coherent Risk Measures

Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, Shie Mannor

Neural Information Processing SystemsOct-1-2025, 23:18:20 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, coherent risk measure, risk measure, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > New York (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.47)

Add feedback

PAC-Bayesian Bound for the Conditional Value at Risk

Neural Information Processing SystemsAug-16-2025, 13:39:52 GMT

The goal in statistical learning is to learn hypotheses that generalize well, which is typically formalized by seeking to minimize the expected risk associated with a given loss function.

concentration inequality, inequality, risk measure, (15 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Nevada (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(4 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)

Add feedback

d02e9bdc27a894e882fa0c9055c99722-Paper.pdf

Neural Information Processing SystemsAug-16-2025, 13:39:46 GMT

concentration inequality, random variable, risk measure, (14 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Nevada (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(4 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Bayesian Risk Markov Decision Processes

Neural Information Processing SystemsAug-15-2025, 17:27:03 GMT

Markov decision process (MDP) is a paradigm for modeling sequential decision making under uncertainty. From a modeling perspective, some parameters of MDPs are unknown and need to be estimated from data.

artificial intelligence, br-mdp, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > Portugal > Braga > Braga (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

Deng, Zilong, Khan, Simon, Zou, Shaofeng

arXiv.org Artificial IntelligenceMar-20-2025

In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level $\tau$ at each step, named Iterated CVaR. We first build a connection between Iterated CVaR RL with $(s, a)$-rectangular distributional robust RL with the specific uncertainty set for CVaR. We develop nearly matching upper and lower bounds on the sample complexity for this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an $\epsilon$-optimal policy with at most $\overset{\sim}{O}\left(\frac{SA}{(1-\gamma)^4\tau^2\epsilon^2}\right)$ samples, where $\gamma$ is the discount factor, and $S, A$ are the sizes of the state and action spaces. Furthermore, if $\tau \geq \gamma$, then the sample complexity can be further improved to $\overset{\sim}{O}\left( \frac{SA}{(1-\gamma)^3\epsilon^2} \right)$. We further show a minimax lower bound of $\overset{\sim}{O} \left(\frac{(1-\gamma \tau)SA}{(1-\gamma)^4\tau\epsilon^2}\right)$. For a constant risk level $0<\tau\leq 1$, our upper and lower bounds match with each other, demonstrating the tightness and optimality of our analyses.We also investigate a limiting case with a small risk level $\tau$, called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of $\overset{\sim}{O}\left(\frac{SA}{p_{\min}}\right)$, where $p_{\min}$ denotes the minimum non-zero reaching probability of the transition kernel.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

arXiv.org Artificial Intelligence

2503.08934

Country:

North America > United States > Arizona (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Thailand (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

An Upper Confidence Bound Approach to Estimating the Maximum Mean

Kun, Zhang, Guangwu, Liu, Wen, Shi

arXiv.org Artificial IntelligenceAug-7-2024

Estimating the maximum mean finds a variety of applications in practice. In this paper, we study estimation of the maximum mean using an upper confidence bound (UCB) approach where the sampling budget is adaptively allocated to one of the systems. We study in depth the existing grand average (GA) estimator, and propose a new largest-size average (LSA) estimator. Specifically, we establish statistical guarantees, including strong consistency, asymptotic mean squared errors, and central limit theorems (CLTs) for both estimators, which are new to the literature. We show that LSA is preferable over GA, as the bias of the former decays at a rate much faster than that of the latter when sample size increases. By using the CLTs, we further construct asymptotically valid confidence intervals for the maximum mean, and propose a single hypothesis test for a multiple comparison problem with application to clinical trials. Statistical efficiency of the resulting point and interval estimates and the proposed single hypothesis test is demonstrated via numerical examples.

estimator, ga estimator, lsa estimator, (16 more...)

arXiv.org Artificial Intelligence

2408.04179

Country: